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Abstract 

We present teraflop-scale calculations of biomolecular electrostatics enabled by the combination of algorithmic and hard- 
ware acceleration. The algorithmic acceleration is achieved with the fast multipole method (fmm) in conjunction with a 
boundary element method (bem) formulation of the continuum electrostatic model, as well as the bibee approximation 
to BEM. The hardware acceleration is achieved through graphics processors, GPUs. We demonstrate the power of our 
algorithms and software for the calculation of the electrostatic interactions between biological molecules in solution. 
The applications demonstrated include the electrostatics of protein-drug binding and several multi- million atom systems 
consisting of hundreds to thousands of copies of lysozyme molecules. The parallel scalability of the software was studied 
in a cluster at the Nagasaki Advanced Computing Center, using 128 nodes, each with 4 GPUs. Delicate tuning has 
resulted in strong scaling with parallel efficiency of 0.8 for 256 and 0.5 for 512 GPUs. The largest application run, with 
over 20 million atoms and one billion unknowns, required only one minute on 512 GPUs. We are currently adapting our 
BEM software to solve the linearized Poisson-Boltzmann equation for dilute ionic solutions, and it is also designed to be 
flexible enough to be extended for a variety of integral equation problems, ranging from Poisson problems to Helmholtz 
problems in electromagnetics and acoustics to high Reynolds number flow. 

J Keywords: bioelectrostatics, fast multipole method, boundary element method, integral equations, graphics 
processors, GPU 
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1. Introduction 

Electrostatic interactions play an essential role in the 
structure and function of biomolecules (proteins, DNA, cell 
membranes, etc.) [73, 83]. One of the most challenging 
aspects for understanding these interactions is the fact 
that biologically active molecules are almost always in 
solution — that is, they are surrounded by water molecules 
and dissolved ions. These solvent molecules add many 
thousands or even millions more degrees of freedom to any 
theoretical study, many of which are of only secondary im- 
portance for investigations of interest. Classical molecular 
dynamics (md) methods, implemented in software libraries 
such as CHARMM [17] and namd [61], use all-atom repre- 
sentations of the biomolecule and solvent, and compute the 
trajectories of every atom over time by following Newton's 
equations of motion. MD methods are the most detailed 
approach to studying biomolecular systems, but comput- 
ing an "average" electrostatic energy in such systems can 
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be extremely expensive, even when one uses efficient meth- 
ods for the long-range electrostatics between all the atoms, 
such as particle-mesh Ewald [26] or multilevel summation 
[76]. For many studies, a faster method based on adequate 
approximations is a necessity. 

In contrast with all-atom MD computations, one can 
model the electrostatic interactions in solvated molecules 
using a continuum representation. Such a model is based 
on assuming that the molecules and the solvent can be 
treated as continuous dielectric media with different di- 
electric constants, low and high, respectively. Inside the 
biomolecule, in addition, point charges are arranged ex- 
plicitly at the atomic positions. Thus, the electrostatic 
potential can be described by a Poisson equation, which for 
general shapes of the dielectric boundary has to be solved 
numerically. Accounting for ionic charge distributions in 
the solvent introduces some extra complications, as the ion 
locations depend on the combined effect of all charges, di- 
electric distributions, and the ions themselves. Making the 
assumption that the average electrostatic potential multi- 
plied by the charge of the ion determines the mean force 
acting on the ion particle, a Poisson-Boltzmann model 
for biomolecular systems is obtained [73]. Standard nu- 
merical approaches such as finite- difference methods and 
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finite-element metiiods can be used to solve the Poisson 
or Poisson-Boltzmann equations [84, 37, 7], but there are 
several challenges that must be overcome. These include 
the difficulty of mapping an irregular molecular surface to 
a volumetric mesh, representing the source distribution as 
a set of discrete point charges, and convergence issues as- 
sociated with dielectric discontinuities. The development 
of various strategies to mitigate these problems (see, e.g.^ 
[18, 34]) and the availability of scalable, open-source soft- 
ware such as APBS [7] have helped make the continuum 
model a popular approach for studying molecular electro- 
statics. 

An alternative approach to solving the Poisson equa- 
tion directly is to use a boundary-integral formulation and 
the boundary-element method, BEM, to determine the in- 
duced charge distribution on the molecular surface which 
accounts for the change in polarization charge across the 
dielectric boundary. One significant advantage of a rem 
formulation is the fact that the discretization is performed 
over the surface of the biomolecule, rather than the three- 
dimensional volumetric region occupied by the molecule 
and solvent. Accurate solution of the linearized form of 
the Poisson-Boltzmann equation is also possible with BEM 
using various specialized techniques [45, 90, 50, 3]. The 
main challenge in the BEM is the computational cost of 
finding the induced-charge distribution, which is obtained 
by solving a linear system in which the matrix is dense (in 
contrast to the sparse linear systems associated with finite- 
difference and finite-element methods). To greatly reduce 
the expense of solving such a linear system using Krylov 
iterative methods such as gmres [69], the fast multipole 
method (fmm) [66, 39, 57] can be used to calculate the 
dense matrix- vector product needed in the iterative solver 
in 0{N) operations [14]. In this way, the fmm algorithm 
and similar approaches [62, 2] can enable calculations with 
hundreds of thousands or even millions of degrees of free- 
dom {e.g. [14, 46, 50]). 

In this paper, we demonstrate a fast molecular elec- 
trostatics application using a BEM formulation of the con- 
tinuum model. The application is accelerated both at the 
algorithmic level and by hardware, achieving an unprece- 
dented capacity for simulating large biomolecular systems. 
Our largest example models a collection of biomolecules, 
totalling more than 20 million atoms, for which the BEM 
problem has more than one billion unknowns. It was ac- 
complished using a cluster of 128 nodes with a total of 512 
GPUs and required approximately one minute to complete. 
Thus, the electrostatic interactions between large proteins 
and molecular machines with tens to hundreds of thou- 
sands of atoms can be simulated in a few minutes on a 
single GPU or small GPU cluster. This computational per- 
formance transforms the landscape for theoretical inves- 
tigations of biomolecular electrostatics such that the new 
limiting factor is the generation of suitable meshes for the 
BEM, rather than fast solver approaches. We wish to em- 
phasize that we are referring to calculations of static struc- 
tures (for example, as in the post-processing stages of MD- 



based estimates of binding free energies using mm-pbsa 
methods [53]) rather than actual dynamics. Substantial 
hurdles remain before BEM electrostatics can be practical 
for implicit-solvent dynamics, including both meshing and 
the calculation of suitable forces [51, 79]. Our solver is cur- 
rently experimental, but already available publicly via the 
repository of the open-source PetFMM library^, which pro- 
vides a parallel implementation of the fmm with dynamic 
load balancing [25]. The GPU kernel implementations for 
the GPU architecture are derived from the 2009 Gordon 
Bell prize-winning FMM code presented in [40]. 

There have been some notable recent publications re- 
porting GPU-accelerated algorithms related to our present 
contribution. The multilevel summation method [76] for 
calculating electrostatic potentials is implemented for the 
GPU architecture in [41]. In this algorithm, the short-range 
interactions are equivalent to the particle-to-particle (p2p) 
operations in the fmm, while the long-range forces are ob- 
tained from hierarchical interpolations; both the short- 
range interactions, and the dominant part of the long- 
range ones were implemented as GPU kernels, achieving an 
overall speed-up of 26 x over the CPU version. However, 
we note that the algorithm is completely dominated by the 
short-range pair-wise interactions, which on the CPU take 
90% of the runtime and on the GPU take 75%, as shown 
in Table 1 of [41]. In this respect, the approach is quite 
different from the one followed with the fmm, where one 
aims for an optimal balance between the 0{N'^) p2p and 
the multipole-to-local (m2l) transformation (see Figure 2 
for definitions of the fmm operations). The demonstration 
runs in [41] use a 1.5 million atom water box, with a total 
runtime of 20 s on one GTX 280 GPU. 

Another related work to the present contribution was 
presented in [47], where a variant of the fmm called the 
kernel-independent fast multipole method [85] is imple- 
mented for multi-CPU systems. That algorithm and imple- 
mentation was demonstrated on an impressive 256 million 
point-system, taking 2 s to compute the interaction on 256 
CPUs. The test consisted of a uniform random distribution 
of particles in a unit cube (tests with non-uniform distri- 
butions are also reported, but on CPU-only systems), and 
a consistent speed-up measure of 25 x is obtained from the 
GPU on weak scaling tests. This GPU-accelerated kernel- 
independent FMM has been recently used in an award- 
winning application to simulation of red-blood cells [64]. 
Finally, in [80] a GPU-accelerated BEM for the Helmholtz 
equation is presented and demonstrated up to one million 
unknowns. However, this work does not use algorithmic 
acceleration by means of the fmm, performing instead the 
0{N'^) method. 

Even with linear-scaling algorithms such as multigrid 
approaches for finite-element methods [7] and fmm for 
BEM, the computational costs of solving the continuum 
electrostatic model can be prohibitive, especially when 
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the scientific questions of interest require the simulation 
of many mihions of unknowns. Such investigations in- 
clude in silico screening of candidate drugs [49], protein 
design [38, 81], and implicit-solvent MD, in which one com- 
putes the trajectories of all protein atoms but replaces 
the effect of solvent molecules with an appropriate poten- 
tial of mean force [36]. These applications have driven 
the development of much faster approximate models such 
as the popular generalized-Born (gb) models [78] which 
are frequently used in implicit-solvent molecular dynamics, 
e.g. [31]. Recently, more innovative approaches have also 
been described, many of which also employ CPUs [32, 4]. 
In the present paper, we approximate the bem solution us- 
ing the new bibee (boundary-integral-based electrostatics 
estimation) model [8, 12], which can be more than an order 
of magnitude faster than bem simulation; thus, bibee is 
well-suited for rapid screening of candidate drug molecules 
(a common application of solvers such as APBS [7]), but like 
BEM it is not presently applicable to computing dynamics. 
Modern GB models and implementations have been exten- 
sively parameterized and optimized, giving them certain 
performance and accuracy advantages over early, unopti- 
mized implementations of bibee when computing electro- 
static solvation free energies. However, bibee does offer 
several of its own advantages, including the fact that it al- 
lows scientists to fully leverage well-known numerical tech- 
niques such as FMM, rather than necessitating new imple- 
mentation. The ability to employ sophisticated computa- 
tional primitives such as FMM as an algorithmic substrate 
is particularly important in the face of rapidly evolving 
hardware architectures. Thus, there remains a healthy 
(and in our view, productive) tension between efforts to 
develop general fast methods, one of which we apply in 
this work, and efforts to specialize existing fast methods 
for particular problems such as biomolecular electrostat- 
ics [32, 4]. 

This contribution aims primarily at demonstrating the 
power of multiplying speedups: fast linear-scaling algo- 
rithms, rigorous approximation of the continuum dielec- 
tric model using bibee, and hardware acceleration with 
GPUs. The first wave of successful applications of the GPU 
in scientific computing was crowded with highly parallel al- 
gorithms (like md) which fit well to the hardware architec- 
ture. As could be expected, it is generally much more dif- 
ficult to obtain high performance with the more elaborate 
hierarchical and 0{N) algorithms. But it is in this com- 
bination where leaps in computing capability are possible 
which are orders of magnitude larger than what Moore's 
law would achieve in a given period. In the bioelectrostat- 
ics application, the multiplying speedups of algorithm and 
hardware are further enhanced by a mathematical model 
that does not lavish computational effort on nonessential 
degrees of freedom (the solvent molecules). Many applica- 
tions may still require detailed modeling at the microscale, 
but where the continuum approach gives sufficient accu- 
racy, the methods and software of our contribution can 
enable high-impact advances. 




Figure 1: In the continuum, implicit-solvent model for 
biomolecular electrostatics, the interior of the biomolecule 
is a dialectric with permittivity e/ and free charges qi, 
while the exterior, solvent-filled region is a dialectric ma- 
terial with permittivity e//. 



2. Background on the models and algorithm 

2.1. Bioelectro statics using the continuum model 

The continuum electrostatic model we treat in this 
work is a mixed-dielectric Poisson problem, described as 
follows (see Figure 1). The molecular interior, denoted as 
region /, is treated as a homogeneous dielectric with per- 
mittivity 6/; typical values are between 2 and 10 [73, 71]). 
The molecular charge distribution is modeled as a set of 
Tic discrete point charges located at the atom centers. De- 
noting the i*^ charge as having value qi and position r^, 
the electrostatic potential in this region satisfies a Poisson 
equation. 
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The water surrounding the molecule, corresponding to 
region //, is also modeled as a homogeneous dielectric, of 
permittivity e// equal to that of bulk water (approximately 
80). In this region, the Laplace equation V'^^fnir) = 
holds, because we assume that there are no fixed or mo- 
bile charges. At the dielectric boundary, 1], the potentials 
and (in the absence of a free surface charge) the normal 
components of the electric displacement fields are contin- 
uous: 
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Here, n denotes the outward unit normal vector at r G 1^, 
pointing into region // from region /. We define the 
boundary by rolling a probe sphere (designed to mimic 
a water molecule) over the union of spheres that repre- 
sent the solute atoms, a definition known as the solvent- 
excluded surface (also known as the molecular surface) [65, 
24]. 

The solute charges polarize the solvent, which in turn 
creates a reaction potential in the solute. In the mixed- 
dielectric continuum model, the solvent polarization ap- 
pears as a layer of induced charge G(r) at the dielectric 
interface, where <j(r) satisfies the second-kind Fredholm 
boundary integral equation [68, 55, 74, 9]: 
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The reaction potential in the solute is then just the 
Coulomb potential induced by (j{r)^ i.e., 
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such that the total potential (pi{r) is the sum ofcp^^^^ and 
the bare Coulomb potential induced by the point charges. 
The electrostatic energy associated with the reaction po- 
tential represents the change in electrostatic energy asso- 
ciated with transferring the given charge distribution and 
molecular shape from a uniform low dielectric into the high 
dielectric medium. For this reason, the energy is called the 
solute's electrostatic solvation free energy /\Q^o\v,es^ ^^^ ^-^ 
may be written as 
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Electrostatic solvation free energies and the bare Coulomb 
energies can be used to estimate quantities such as electro- 
static contributions to protein stability [77] or the binding 
affinity between molecules [19]. 

One may also investigate the effect of dilute ionic so- 
lutions on a solute using either the linearized Poisson- 
Boltzmann (pb) equation V^(/?//(r) = n^(fii{r) to model 
the potential in the solvent, or the full nonlinear PB equa- 
tion [72] . The linearized PB problem can also be solved us- 
ing boundary integral equations [87, 45, 50]. For simplicity 
in presenting the FMM, we treat only the mixed-dielectric 
Poisson problem; however, our implementation can also 
solve Helmholtz problems and we are currently adapting 
the method for solving linearized Poisson-Boltzmann bem 
problems. 

The ffist step in using bem to solve Equation(4) is to 
divide the boundary Vt into n^ discrete, non-overlapping 
pieces, called panels or boundary elements. Often, for 
complicated geometries one approximates the boundary 
using easily defined boundary elements such as planar tri- 
angles, a practice we follow here. Our solver infrastructure 
supports the use of more accurate representations with 
curved boundary elements, as in [48, 3], which we intend 
to explore in future work. The second step in bem is to 
define the space of possible solutions; here, we define n^ 
piecewise-constant functions, the i^^ of which takes a value 
of 1 on element i and zero elsewhere. The third step is 
defining what constraints the approximate solution should 
satisfy [6, 9, 11]. Galerkin methods enforce the residual 
to be orthogonal to the basis set, whereas point colloca- 
tion methods enforce that the residual is exactly zero at 
specified points on the boundary. 



By enforcing n^ constraints, one obtains a square ma- 
trix equation 
= a(r). Ax = Bq, (7) 

where B is the rip-hy-nq dense matrix that maps the point 
charge values (the vector q) to the normal electric field that 
they induce at the surface, and A is the n^-by-n^ dense 
matrix associated with the second-kind integral operator 
of Equation(4). The unknown basis- function weights x 
are found in practice by solving the linear system using 
Krylov-subspace iterative methods such as gmres [69] and 
preconditioning, and instead of computing all the entries 
of A explicitly one uses fast-summation techniques such as 
the fast-multipole method (fmm) [66, 39, 57], precorrected- 
FFT [62], or FFTSVD [2] to compute dense matrix-vector 
products using only linear, 0(np), or near-linear time and 
memory [27]. After obtaining x, the vector of reaction po- 
tentials at the charge locations can be computed as the 
matrix- vector product ip^^^^ — Cx, where C is the Uq- 
hy-Up matrix resulting from discretizing the Coulomb op- 
erator of (5). Therefore, the overall electrostatic solvation 
free energy can be written as ^q^CA~^Bq. 

In this work, we use a Galerkin approach and the sim- 
plest possible quadrature rule — a single point, located at 
the center of the triangle — although the solver supports 
more sophisticated approaches. Representing a{r) using 
point charges is a common approach for estimating ener- 
gies, especially in quantum chemistry [23], and is adequate 
for the current demonstration of the GPU-accelerated fmm 
using BIBEE. For planar elements and polynomial basis 
functions, one may also compute some of these integrals 
analytically [42, 59] rather than by numerical quadrature, 
and implementation of these approaches is underway. 

2.2. BIBEE approximation to the continuum model 

The recently developed bibee (boundary-integral-based 
electrostatics estimation) models compute approximations 
to a molecule's electrostatic solvation free energy in the 
mixed-dielectric Poisson model [8, 12]. bibee calculations 
can be at least an order of magnitude faster than bem sim- 
ulation, and the approximations reproduce numerous im- 
portant characteristics of the actual Poisson model, which 
make bibee an appealing new approach for studying elec- 
trostatic interactions within and between large biological 
molecules. 

A BIBEE calculation approximates the solution of the 
boundary-integral equation approach in the previous sec- 
tion. As described earlier, the bem approach to calculat- 
ing the electrostatic solvation free energy requires three 
steps: the computation of Bq, the normal electric field at 
the boundary; solving the linear system Ax — Bq to ob- 
tain the induced surface charge distribution; and the com- 
putation of the reaction potentials, (^^^^^ = Cx. bibee 
approximates the second step by replacing the electric- 
field operator in (4) with a scaled identity operator. One 
therefore obtains an approximate surface charge density a 
rather than the actual solution a. We use a scale factor of 



— 1/2, which corresponds to an extremal eigenvalue of the 
electric-field operator [6]: 
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Other useful bibee approximations may be obtained by 
employing or +1/2 as scale factors, which are also im- 
portant eigenvalues for the integral operator [8, 12]. 

Numerical bibee calculations can be performed using 
simple modifications of bem implementations for solving 
(4). The bibee/cfa approximate solvation free energy is 
written as ^q^CD~^Bq^ where C and B are the same as 
in BEM, and D is a diagonal matrix [8]. bibee's speed 
advantage over bem simulation arises because D~^ is triv- 
ial to apply, whereas application of A~^, which entails the 
Krylov- iterative solve, is the most computationally expen- 
sive step in computing electrostatic solvation free energies 
using BEM. 

Using Eq. (8), the approximate electrostatic solvation 
free energy for a single point charge (setting all others to 
zero) is equal to the volume integral of the energy density 
of the Coulomb field produced by the lone charge, where 
the volume of integration includes the entire solute volume 
except for a spherical region associated with the atomic 
charge in question (which eliminates the singularity [63]). 
This solvation free energy, because it is associated with 
setting Qi = 1 for some i and the other charges to zero, 
is known as the i^^ self energy. The use of such a vol- 
ume integral to approximate a point charge's self energy is 
well-known as the Coulomb-field approximation [63], and 
therefore we refer to Eq. 8 as the bibee/cfa model. 

bibee/cfa provides new insights into the character 
of the CFA, including that the CFA is exact for charge 
distributions that generate uniform normal fields on the 
boundary [8] and that the bibee/cfa approximate elec- 
trostatic solvation free energy is an upper bound to the 
actual free energy that would be obtained by solving the 
Poisson model [12]. Ironically, although the CFA gives 
a mathematically rigorous approximation of the Poisson 
problem, until recently it had been relegated to comput- 
ing parameters for non-rigorous, purely empirical electro- 
static models in the generalized-Born methods [78, 63]. In 
fact, the CFA was originally introduced in volume-integral 
form to calculate the self-energies of point charges [63]. 
Despite the unphysical basis of GB methods, they have 
proven to be effective at approximating solvation free en- 
ergies. As a result, later research focused on calculating 
more accurate self-energies (improvements "beyond the 
Coulomb-field approximation" [67]) rather than on analy- 
sis of the CFA. Borgis et al. achieved a significant theoret- 
ical advance with their variational Coulomb-field approxi- 
mation [16], which allowed for the first time the treatment 
of multiple charges in the CFA, eliminating the need for the 
nonphysical approximations inherent in GB. Several years 
later, independent analysis of the boundary-integral equa- 
tion (4) led to the bibee models [8] . This latter work was 



inspired by the observation of the relationship between the 
CFA volume integral and Equation (4) for a spherical solute 
with a single central charge [35]. bibee/cfa appears to be 
the surface form of the earlier variational CFA of Borgis et 
al. 

2.3. Fast multipole method 

The fast multipole method is an algorithm that ac- 
celerates the computations required in A^-body problems, 
which are expressed as a sum of the form 
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f(yj) = '^CiK{yj,Xi). 
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Here, f{yj) represents a field value evaluated at a point ?/j, 
where the field is generated by the influence of sources lo- 
cated at the set of centers {xi}. The evaluation of the field 
at the centers themselves corresponds to the well-known 
A/'-body problem. Thus, {xi} is the set of source points 
with weights given by q, {i/j} the set of evaluation points, 
and IK(?/, x) is the kernel that governs the interactions be- 
tween evaluation and source points. Obtaining the field 
/ at all the evaluation points requires in principle 0{N'^) 
operations, if both sets of points have A" elements each. 
Fast summation algorithms aim at obtaining / approxi- 
mately with a reduced operation count, ideally 0{N). In 
our application, we use two types of fast summation: in 
the first, we replace the source points with a discretization 
of the induced charge on the molecular surface <j, evaluate 
the field at the same surface points, and use as the ker- 
nel the boundary integral operator in Equation 4. In the 
second type of fast summation, the source points are the 
point charges in the molecule and the field is evaluated at 
points on the surface Q. 

There are two main aspects to the fmm algorithm: one 
is related to the spatial hierarchy and tree data structure; 
the other, to the mathematical machinery needed to per- 
form approximations which reduce the number of opera- 
tions in exchange for accuracy. The spatial hierarchy refers 
to a successive sub-division of the three-dimensional spa- 
tial domain and construction of an associated oct-tree: the 
domain is initially divided in eight cells, which are then di- 
vided again, and so on, until the finest level of refinement 
or "leaf level" . The set of cells created by each sub-division 
is associated to the nodes at a given level of a tree struc- 
ture, such that near- and far-domains to each cell can be 
found by operations on the tree. The source points be- 
longing to cells in the far-domain are then considered as a 
cluster, with their infiuence represented by a series expan- 
sion in the process of evaluation. This representation of 
many sources collectively by series expansions allows the 
reduction of operations in the fmm to 0{N)^ with control- 
lable accuracy. 

One can illustrate the phases of the algorithm using 
a diagram of the tree associated to the spatial division, 
thereby directly relating the algorithm to the data struc- 
ture used by the FMM; see Figure 2. First, we need to 
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Figure 2: On a tree diagram, we can illustrate the organization of the algorithmic stages for the fmm: upward sweep ^ 
downward sweep^ and evaluation step. The upward sweep combines the p2m and m2m operations, downward sweep the 
m2l and l2l, and the evaluation combines l2p with a direct calculation in the near domain, p2p. 



introduce the terminology for the approximation of the 
kernel action at long and short distances, as follows. 

\> Multipole Expansion (ym): a series expansion trun- 
cated after p terms that represents the influence of 
a cluster of source points, and is valid at distances 
large with respect to the cluster radius. 

> Local Expansion (^le^: a truncated series expansion, 
valid only inside a sub-domain, that is used to effi- 
ciently evaluate a group of mes locally to a cluster 
of evaluation points. 

The computation of the action of the kernel using fmm 
proceeds in stages: an upward sweep^ downward sweep^ and 
evaluation stage. In the upward sweep ^ the objective is to 
build the mes for each node of the tree. The mes are built 
first at the tree leaves (p2m operation) and then translated 
to the center of the parent cells (m2m translation). This 
process is illustrated in Figure 2 by the black arrows point- 
ing up from the nodes on the left side of the figure. Notice 
that at each level above the leaves, mes are computed by 
shifting and then combining the MEs of the child cells, 
which results in a reduction by a factor 8 of the number of 
expansions that needs to be calculated. In the downward 
sweep of the tree, the mes are first transformed into les 
for all the boxes in the interaction list — a process repre- 
sented by the dashed red-colored arrows in Figure 2, and 
denoted by m2l. For a given cell, the interaction list cor- 
responds to the cells whose parents are neighbors of the 
given cell's parent, and yet not directly adjacent to the 
given cell. Each le is then translated to the centers of all 



child cells (l2l operation), and combined with the trans- 
formed MES from the child level to obtain the complete 
far-domain influence for each box. This process is repre- 
sented by the dashed arrows going down the right side of 
the tree in Figure 2. At the end of the downward sweep ^ 
each cell will have an le that represents its complete far 
field. Finally, at the evaluation stage^ the field is evaluated 
for every point contained in a cell by adding the near-field 
and far-field contributions: the near field contribution is 
obtained from directly computing the interactions between 
all the points in the adjacent cells, and the far-field con- 
tribution comes from evaluating the le of the cell at each 
point location. 

3. Multi-gpu fast multipole method 

3.1. Brief overview of the distributed fmm 

The distributed-memory parallelization of the fmm in- 
volves two main phases: (i) the partitioning of the tree 
among MPI processes; and, (ii) the communication of 
needed data to perform the fmm interactions. In the first 
phase, the standard and most commonly used technique is 
to equally distribute the Morton-indexed boxes at the leaf 
level [82]. This has been the approach used here. 

The second phase involves the communication that oc- 
curs between partitioned domains, due to interacting cells 
residing in different processes. Two cells interact when 
they are present in each other's neighbor list or interaction 
list (for the far-field), as defined in the previous section. 
In the case of the neighbors, communication only occurs 
among processes holding geometrically adjacent points. 
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Figure 3: L^-norm of the measured normalized error for 
the FMM evaluation versus direct evaluation on the GPU. 



and data consist of position and weights of the points or 
particles. For the far-field, the data that needs to be com- 
municated consists of ME coefficients of the cells in the in- 
teraction list, at every level of the tree. We overlap these 
communications with local computations, which results in 
effective hiding of communication time thereby enhancing 
scaling (as seen in the results section). Moreover, we have 
achieved linear scaling in memory requirements, which al- 
lows our large-scale computations of biomolecular electro- 
statics. 

3.2. Synopsis of GPU hardware utilization 

All the computational kernels of the FMM depicted in 
Figure 2 were reformulated to utilize the GPU architecture 
with CUDA [60], using single precision. The GPU kernels 
were verified against CPU calculations, and full FMM evalu- 
ations were also validated with respect to direct (all-pairs) 
evaluation. Figure 3 shows the measured error (L^-norm 
of the relative error) for a set of calculations with various 
numbers of points, comparing the FMM on GPU with direct 
evaluation. 

One particular feature of the GPU execution that we've 
incorporated in the code is an effective buffering of input 
and output data to each kernel, such that a kernel is always 
executed on sufficiently large data sets. In other words, a 
given kernel, e.g.^ m2l, is not executed for each of the in- 
teractions, but for a group of them. At the same time, 
data is made contiguous in memory before the kernel call 
and output memory is also made contiguous in the GPU 
buffer; these are essential measures to attain good perfor- 
mance. On the GPU, the memory for evaluation points is 
padded to match the size of a thread block. For example, 
if the number of multipole coefficients is 55 and the thread 
block size is 64, the code will insert zero padding for in- 
dices 56-64. This coalesced write to evaluation points was 
found to be much more efficient than using a compressed 
buffer. 



4. Computational results 

4.1. Hardware 

The calculations were performed using the Degima clus- 
ter at the Nagasaki Advanced Computing Center, which 
presently consists of 288 nviDiA GTX 295 cards, each with 
two GPUs. There are two cards per host node, amounting 
to 144 CPUs and 576 CPUs — however, in this work we have 
only used 128 nodes out of the total 144, as some of the 
nodes remain in experimental use. Figure 4 shows the con- 
figuration of the interconnect between the nodes. There 
are 36 nodes connected to each of the 6 QDR switches. 
The bandwidth of SDR is 10 Gbps and for the QDR it is 
40 Gbps. With 4 qdr networks, a total bandwidth of 160 
Gbps is achieved between the switches. Each circle in the 
Figure represents one compute node equipped with 1 CPU 
and 4 CPUs. 

4.2. Overview of computational experiments 

The calculations we report here demonstrate the speed 
and scalability of our FMM algorithm on realistic biomolec- 
ular problems, without any attempt at this point of offer- 
ing biological insights into our chosen examples. Much 
more detailed simulation and analysis are required to ob- 
tain meaningful understanding of these complex systems. 
Furthermore, it bears repeating that techniques such as 
explicit-solvent molecular dynamics offer much more de- 
tail and are to be preferred in some circumstances, but 
are not always practical for some types of problems. 

We first present the scalability of the FMM on a large 
number of GPUs to demonstrate the computational power 
of the present method. Subsequently, we study the con- 
vergence and accuracy of the FMM-based bem approxima- 
tion BIBEE using as a model problem a clinically relevant 
protein (implicated in cancer) bound to a small-molecule 
inhibitor. We then calculate the electrostatics of several 
multi- million atom systems by creating arrays of molecules 
of the protein lysozyme. This example is inspired by the 
pioneering work of McGuffee and Elcock [54], who were 
among the first to conduct atomistic-level simulations of 
concentrated protein solutions. This kind of large-scale, 
computationally demanding study is important because 
proteins in biological systems, especially inside cells, op- 
erate in crowded environments that can strongly infiuence 
protein behavior [56]. Computational expense has long 
been a severe constraint on scientists' ability to study such 
systems theoretically. 

Most investigations use implicit-solvent models in con- 
junction with Brownian dynamics (see, for example, [33]) 
to maximize the time scales that can be simulated at rea- 
sonable computational cost. Still, however, most studies 
have of necessity employed reduced models of the proteins 
(e.^., spheres) despite evidence that protein shape plays 
a significant role [58, 21]. Atomistic-level treatments have 
also been forced to use some approximations to the physics 
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Figure 4: Infiniband network configuration of the Degima cluster in Nagasaki. 



(approximating the electrostatics) to make the computa- 
tions feasible [29]. To our knowledge, our ability to com- 
pute these interactions rigorously in seconds is unprece- 
dented, and may enable more accurate studies than have 
been possible previously. 

The Appendix provides details regarding the prepara- 
tion of the protein structures as well as the surface dis- 
cretizations. All calculations were performed with e/ = 4 
and eji = 80. 

4-3. Scalability of the fmm on a GPU cluster 

In this section, we present and discuss results from a 
scalability study of the fmm on many GPUs. We investi- 
gate scalability under the more severe strong scaling sce- 
nario, where the amount of computational work in the ex- 
periments remains constant as the number of computing 
devices (here, GPU chips) is increased from 1 to 512. In 
addition to parallel efficiency, we analyze the breakdown 
of execution time by computational kernels, and also by 
CUDA states. The results show excellent parallel efficiency 
up to 256 GPUs, and subsequent decline as serial fraction 
and communication overheads start to dominate. Note 
that the 256 GPU chips offer a total of 61,440 CUDA cores; 
the case on 512 GPUs is in actual fact running several mil- 
lion simultaneous threads. 

It must be emphasized that achieving high parallel ef- 
ficiency on a cluster of GPUs is generally much more diffi- 
cult than on a cluster of CPUs. As pointed out above, each 
GPU chip is already a parallel device, in this case (GT200b 
GPu) offering 240 multi-threaded cores. This results in a 
high compute capability which makes it very difficult to 
hide the time required for communications. Moreover, the 
current technology does not provide an avenue for direct 
communication between GPU cards, and all data must be 
transfered to and from the host CPU first; communica- 
tion between nodes then occurs via standard MPI calls. 



Despite these serious challenges, we can report excellent 
parallel efficiency, as already mentioned. 

Experimental setup. The scalability study is based on a 
fixed-size problem with N = 10^ points placed randomly 
in a cubic volume. The order of the multipole expansions 
was set to p = 10, which results in 4 significant digits of 
accuracy, and spherical harmonic rotations are performed 
before each translation at 0{p^) cost. One fmm evaluation 
is performed (equivalent to one matrix- vector product, or 
one BIBEE evaluation, or the cost of one iteration in a bem 
solution) for each case, with one MPI process running per 
GPU. We run one MPI process per cluster node up to 
128 processes, and then two and four processes per node, 
respectively, in the 256- and 512-GPU cases (recall that 
each node has two dual-CPU cards). 

Results. As shown in Figure 5, parallel efficiency remains 
perfect up to 128 CPUs, is 78% for 256 CPUs, and 48% for 
512 CPUs. We observe "super-linear" scaling between 4 
and 32 CPUs, which we ascribe to better cache usage in 
the tree construction phase of the calculation, done on the 
CPU. This is evidenced by the breakdown of time shown in 
Figure 6, where the tree construction takes less per-process 
time when going from 1 to 2 and 4 CPUs. 

Figure 6 is a bar plot of the runtime multiplied by the 
number of GPUs (number of MPI processes), where each 
bar is also color-coded to indicate the time spent in each 
computational kernel, including tree construction (done on 
CPu) and MPI communications. Equal bar heights would 
indicate perfect strong scaling, so once again we observe 
super-linear scaling for 4-32 processes and efficiency de- 
cline for more than 128 GPUs. The breakdown helps to 
make these observations: tree construction on the CPU is 
overburdened on one process, possibly due to cache mis- 
use; as expected, the p2p and m2l kernels take the largest 
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Figure 5: Parallel efficiency in strong scaling of the fmm 
on multi-GPUs: TV = 10^. 
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Figure 6: Breakdown of the fmm calculation time corre- 
sponding to the runs in Figure 5, with N = 10^. 



fractions of runtime; the fraction of time for MPI commu- 
nications is minor, up until 256 processes; and, tree con- 
struction and communications become significant at 512 
processes. 

The tree construction consists of all the preprocessing 
required to execute the fmm kernels, including the Morton 
indexing and bookkeeping of the interaction list, taking as 
input the set of points already assigned to corresponding 
processes. The MPI communication time, on the other 
hand, consists of the time required to update the data of 
points in the local essential tree ( "mpisendp2p" ) and the 
time required to update the multipole expansions ("mp- 
isendm21"). In sum, the height of each bar in Figure 6 is 
the total wall-clock time it takes to perform one complete 
evaluation using the fmm (multiplied by the number of 
processes). 

The breakdown of the runtime can be also done by the 
phases of CUDA execution, as shown in Figure 7, rather 
than computational kernels. Of course, the communica- 
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Figure 7: Breakdown of the fmm calculation time corre- 
sponding to the run with N procs = 1 in Figure 6, with 
N = 10^ 



tion time and tree construction on the CPU do not change 
in this version; only the fraction of time between the dashed 
lines in the Figure is analyzed differently. We have one 
large fraction of time (maroon color) labeled "cudaKer- 
nel"; this is the cumulative time spent in CUDA execution 
state — expressed in the CUDA code by the triple triangu- 
lar brackets — for all the 6 computational kernels (shown 
in the left bar) combined. The fmm evaluation spends 
70% of the total execution time in this state. Next in im- 
portance is the fraction spent "buffering data" (in green), 
which refers to the reordering of data that is performed to 
improve coarse-grained data locality when executing the 
GPU kernels. The required data is streamlined into a buffer 
in the exact order that it will be accessed inside the GPU. 
Next, but already a minor fraction of the total time, is "cu- 
daMemcpy" (red), a label which is given to the total time 
of all data movement between the host memory and the 
device memory. This is the most important observation to 
make from the breakdown of timings in this manner: the 
total time of all data movement between GPU and CPU is 
a small fraction of the total time, whereas a large fraction 
of time is spent in the CUDA execution state. This is a fa- 
vorable situation in terms of GPU performance, and shows 
that the fmm, unlike many algorithms, can afford to move 
data to/from the GPU due to a high compute/data ratio. 

Discussion. We have shown strong scaling results, demon- 
strating excellent parallel efficiency of the fmm up to 256 
CPUs and acceptable efficiency at 512 CPUs. These results 
were obtained in an experimental cluster using gaming 
GPU cards^ . The configuration of the cluster has two dual- 
chip GPU cards per node, yet we used one GPU per node up 
to 128 nodes to allow for the whole bandwidth to be avail- 
able to each MPI process. In the 256- and 512-GPU cases, 
the bandwidth must be shared within the node, and so 
we applied a strategy to alleviate possible constraints due 
to this. A hierarchical all-to-all communication was im- 
plemented using MPI_Coinin_split, dividing the MPI com- 
municator in 4. Effectively we are splitting the communi- 
cator into an inter- node and an intra- node communicator. 



^The Degima cluster enabled some of the authors to be awarded 
the Gordon Bell prize in the price/performance category in 2009. 
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Figure 8: Rendering of CDK 2 bound to small molecule 
inhibitor, using atomic coordinates from PDB accession 
lOIT [5] and the software PyMOL. 



Thus, we first perform an intra-node MPI_Alltoallv, then 
an inter-node MPI_Alltoallv. We found that this strategy 
was able to speed-up the communication nearly 4 times at 
512 processes; and these are the final results presented in 
Figures 5 and 6. 

4-4- Protein-inhibitor binding calculations 

Understanding the interactions between inhibitors and 
their proteins can help lead to the design of more potent 
drugs with fewer side effects, and computational model- 
ing can be a valuable approach to study inhibitor-protein 
binding [44] . The first model problem that we will use here 
is a protein known as cyclin-dependent kinase 2 (cdk) and 
a small-molecule inhibitor. Cyclin-dependent kinases are 
involved in the control of the cell cycle and implicated in 
the growth of cancers [5] , and it is thought that developing 
tight-binding inhibitors of CDK proteins may be a valuable 
therapeutic approach to treating some types of tumors. 
The atomic structure of a CDK 2 protein bound to a novel 
small-molecule inhibitor was solved using X-ray crystal- 
lography and deposited in the Protein Data Bank [13, 5] 
(see Appendix). Figure 8 shows the drug in the binding 
site. 

As described in §2.2, the integral equation (4) is ap- 
proximated using bibee/cfa to estimate the contribution 
to binding affinity that is due to the polarization of the 
solvent around the protein and drug. We use a widely 
accepted, but somewhat simplistic, approach to estimat- 
ing binding affinity, in which the protein and inhibitor 
bind rigidly in the conformations shown in the crystal 
structure. It has been shown that more realistic mod- 
els, which account for conformational changes on binding, 
require highly accurate electrostatic simulations for con- 
vergence [3]. Thus, our solver with its GPU capability may 
be a valuable tool to improve drug modeling by reducing 
the computational cost of accounting for flexibility. For 
these types of calculations, it would be ideal to be able 
to demonstrate that the computed energies are correct to 
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Figure 10: The induced charge distribution at the dielec- 
tric boundary for the inhibitor-protein complex. Results 
are in electrons per square Angstrom. 



within 0.1 kcal/mol, which is roughly the accuracy of many 
experimental binding assays. 

Figures 9(a), (b), and (c) contain plots of the computed 
electrostatic solvation free energies for the unbound in- 
hibitor, unbound protein, and inhibitor-protein complex, 
as functions of the number of vertices N on the discretized 
surface. Figure 9(c) plots the solvation free energy of the 
complex minus the solvation free energies of the unbound 
species, i.e.. 
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Meshing becomes problematic for the protein and inhibitor- 
protein complex beyond a vertex density of 10/square Angstrom, 
corresponding to the last data point on Figure 9(d). Finer 
discretizations of the inhibitor surface can be generated 
and we have simulated meshes up to 36 vertices/square 
Angstrom (at which point the surface is discretized into 
19,998 elements), with convergence approaching the de- 
sired 0.1 kcal/mol level. This is consistent with earlier 
work assessing the accuracy of planar elements versus curved 
elements [10]. However, we note that here we have used 
a point-charge approximation of the molecular charge dis- 
tribution, rather than Galerkin or collocation bem, and 
have also used the bibee approximation rather than full 
BEM calculation. Figure 10 shows a plot of the calculated 
surface charge density for the complex, in electrons per 
square Angstrom. 

4-5. A model calculation for protein solutions 

To demonstrate the capability of the method and soft- 
ware on large problems, we use collections of randomly 
oriented lysozyme molecules arranged on a regular Carte- 
sian grid, mimicking the Brownian dynamics calculations 
performed by McGuffee and Elcock [54] at each time step. 
In practice, calculations on such length scales may benefit 
from alternative theories that are very efficient for these 
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Figure 9: (a)-(c): Convergence of calculated free energies as functions of the number of vertices N on the dielectric 
boundaries for the components indicated, (d): Estimated electrostatic solvation free energy contribution to the inhibitor- 
protein binding affinity; the number of vertices N is that in the inhibitor-protein complex mesh. All energies are in 
kcal/mol. 
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Figure 11: Rendering of the discretized surfaces of a collec- 
tion of 1000 randomly oriented lysozyme molecules quasi- 
scattered from a regular Cartesian grid. 
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Figure 12: The induced charge distribution for an isolated 
lysozyme molecule. The surface has been discretized into 
102,486 planar triangles. 



specialized problems, for instance the recent approaches of 
Onufriev et at. [32, 4], but for the present the size of the 
problem showcases the implementation's performance and 
parallel scaling. 

A collection of 1000 proteins is shown in Figure 11. The 
surface charge density for an isolated lysozyme molecule, 
computed using our method, is plotted in Figure 12. Of 
course, actually implementing a Brownian-dynamics method 
requires some special adaptation of bem [15, 52], which 
we have not implemented. The calculation did not employ 
periodic boundary conditions, which would also require 
further adaptation of the solver [86]. 

Figure 13 shows the computation times required for 
single-GPU BIBEE calculations of lysozyme arrays with dif- 
ferent numbers of proteins. As expected, the direct method 
scales quadratically with the number of boundary elements, 
and the fmm scales linearly. Note that bibee on a single 
gpu requires less than one second to compute elec- 
trostatics with up to one million unknowns. The 
largest simulation we have conducted consists of 10,648 
molecules, where each surface was discretized into 102,486 
elements. This calculation, which models more than 20 mil- 
lion atoms and possesses over one billion unknowns, 
required only about one minute per BEM iteration on 512 
nodes. 

Our new ability to rigorously simulate such large sys- 
tems enables the assessment and improvement of current 
heuristics. For example, one heuristic assumes that the 
electrostatic potential generated by each protein does not 
depend on the presence of surrounding proteins, an as- 
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Figure 13: Average times required to compute the dense 
matrix-vector products for the BIBEE model, using a di- 
rect method or the FMM on a single GPU, as a function 
of the total number of boundary elements in lysozyme ar- 
rays; each lysozyme surface was discretized into 102,486 
boundary elements. 
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sumption motivated by the high dielectric solvent, which 
might also contain mobile ions. In the boundary-integral 
equation formulation, this heuristic takes the form of com- 
puting the induced surface charge for a single isolated 
molecule and then employing that surface charge for each 
molecule in the solution, without solving the full bem prob- 
lem self-consistently. Our fast solver raises the possibility 
of more detailed checks of such assumptions, by enabling 
computationally feasible comparison between the heuristic 
energy and the actual energy computed by realizing a full 
BEM solution. 

5. Discussion 

We have built a boundary-element method (bem) solver 
using the bibee approximation on top of a fast-multipole 
method (fmm) code with hardware acceleration using CPUs. 
To demonstrate the solver's speed and scalability, we have 
simulated several problems in electrostatics which model 
the interactions between biological molecules, but it can 
also be used to address numerous other problems, includ- 
ing problems in electromagnetics and the vortex particle 
formulation in fluid dynamics. The combination of op- 
timal algorithms and modern hardware acceleration can 
lead to a qualitative shift in the methodology of computa- 
tional protein science, and improve our ability to compute 
meaningfully converged quantities to be compared with 
experiments. 

The performance demonstrated here will carry over to 
more sophisticated bem applications with better accuracy, 
so we emphasize that the present work should be seen 
as a demonstration of the performance offered by the al- 
gorithms and hardware, rather than as a demonstration 
of the particular electrostatic energies calculated by the 
present implementation. Improved methods include the 
use of collocation or Galerkin formulations [9] or curved 
rather than planar elements [48, 3]. The present paper 
provides motivation to incorporate such methods into our 
solver for fast, accurate computation of molecular elec- 
trostatics. As a second step beyond the implementation 
described here, we will extend our fmm kernels in order to 
treat linearized Poisson-Boltzmann problems [45, 50]. 

In summary, substantial development is needed before 
the solver can be thought of as mature enough to be used 
for scientific studies even for simulations of fixed struc- 
tures. Even with fmm on GPU it seems that the possibility 
of BEM-based implicit-solvent molecular dynamics remains 
quite distant, owing to the numerous unresolved theoret- 
ical issues. When coupled with higher-order bem tech- 
niques, however, the fast multipole method with GPU ac- 
celeration should enable the routine calculation of quanti- 
ties that were previously impractical due to computational 
cost. For example, it is now entirely feasible to conduct de- 
tailed electrostatic component analyses of protein-protein 
interactions in vital biological processes. Such studies [20] 
requiring thousands or tens of thousands of simulations, 
have been prohibitively expensive until now. To a large 



degree, the fmm eliminates actual computation as a lim- 
iting factor; the generation of suitable surface meshes, a 
topic that has received much attention recently [22, 89], 
appears to be one of the primary remaining challenges. In 
addition, the acceleration afforded by our fmm on GPU also 
makes it practical to sample many more protein confor- 
mations in accounting for molecular flexibility [88, 1, 28]. 
Finally, as the arrays of thousands of lysozyme molecules 
demonstrated, our method could allow scientists to study 
complex molecular environments with greater detail — for 
instance, biological solutions are quite "crowded" [30] and 
it is not yet clear how crowding impacts protein diffusion, 
conformation, and association [56, 91]. The GPU acceler- 
ated fast multipole method allows studying mixtures of 
large numbers of proteins and small molecules, and there- 
fore to study crowding in more detail. 

The version of the code used for this work unit is cur- 
rently available in an experimental branch of the PetFMM 
repository. Over time, functionality for BEM will be added 
to the main branch of PetFMM, but we make available 
our experimental code at the time of publication of the 
paper, consistent with the research group's policy of open- 
ness. To find the repository, follow the links provided in 
http://barbagroup.bu.edu/. 
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Appendix: Structure Preparation 

All molecular structures were downloaded from the Pro- 
tein Data Bank [13] (CDK 2-inhibitor accession code lOIT; 
hen egg-white lysozyme accession code IHEL). The psf- 
GEN module of VMD [43] was used to add missing atoms 
as well as appropriate patches at the N- and C-termini 
of all peptide chains. Atomic radii and partial charges 
were taken from the parse parameter set [75]. All surface 
meshes were generated using MSMS [70]. 
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